Antonio Lenton, Universidad de Buenos Aires – Python

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations:

Antonio Lenton, Universidad de Buenos Aires, antoniolenton@gmail.com

Tool(s):

This submission was developed during the course on Information Visualization, using mostly just the Python programming language, with a bit of Processing and R.

Video:

See the video here

ANSWERS:


MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

I adjusted a regular expression to match sick people's messages.

By graphing the amount of sick messages over time (this is, the messages that match my regular expression) you can see it take off suddenly just after 8am on the 18th.

Then I graphed just sick messages between 8:00 and 10am on May 18th in a heatmap, and you can see the trail of messages starting at the bridge over 610 and heading downwind.

If you display messages that report a stomach-related ailment on the 19th, you can also see the disease spreading downriver, from the same bridge over 610, about a day later.


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

My first thought was to visualize the tweets on the map, and start filtering by term. I used a Processing-js display to visualize this and it took me about an evening. Results matched up to the different areas of the city nicely, but were very noisy.

Next I thought of learning association rules, using each message as a cart and each word as an item, to see which words tended to be used together. This took me a couple of evenings, including a small Django webserver to feed preprocessed rules into the Processing-js so that it could recommend related words when you searched for a term. These results were surprising. On one side I had "flu" highly linked to many common names like "Dylan" or "Madison". I thought these would be the names of those that had died, but I couldn't find links between the names and "death" or other noteworthy terms. And on the other side, I found several uncommon typos like "dammage" or "covetion" giving noticeably hight support to rules.

Looking through the messages filtering for some of these terms, you could tell that the dataset was synthetic: lots of messages by sick people were clearly generated by a simple grammar, as I kept finding repeated patterns and small subphrases.

At some point I decided to leverage the fact that the dataset was clearly synthetic. I adjusted a regular expression to match sick people's messages, by writing a script that would find messages that match a simple very popular phrase (like "is killing me") but that didn't match my regular expression. That allowed me to quickly grow the regex into one that has excellent sensitivity and specificity for this dataset, in a couple of hours.

At this point I generated a visualization that displayed the first outbreak on the 18th, but seemed to point at three ground zero points (close to the convention center, hospital and Vastopolis Dome), because I was looking at a timeslice that was too slim. I assumed instead that this was because I was missing some manually crafted messages that linked those three first hotspots to a true first case, so I kept looking.

I generated regexes for messages from people at conventions, from people that had sick friends or relatives (but that weren't necessarily sick themselves), from people in traffic jams, and from people that heard about the 5/17 explosion in Smogtown using this same technique, to help me clear out some true negatives and then trained a naïve Bayes classifier using a randomly picked sample of 500 messages I read myself (that took me quite a while!).

I continued to cull out easy true negatives removing messages that mentioned famous people, brands, popular unrelated hashtags, and topics, using several short Python scripts, but still I found nothing. My maps would just get noisier, and no previous, unique, ground zero emerged.

I tried tracking the locations of the people that were at the right place and time (Downtime, 5/18 8:10) back a couple of to see if they had been in the same place and time somewhere, but that shed no light either. People in Vastopolis live positively brownian lives, moving from one location to a totally different one in an apparently random fashion.

At one point I used different colors for dispaying different types of symptoms, and that brought out the fact that there are actually two different ways the disease is spreading, air and water. Stomach-related symptoms were close to the River Vast's banks, while other lung-related issues were clearly airborne:

When I grayed out messages from people that had already reported they were sick it became clear that the disease was contained. The whole last day is completely grayed out with just a couple of tints of colour, most likely due to a couple of false positives in my filter.

Then Andres showed how you could clearly see the single initial ground zero by just ramping up the time you keep each event on the map, and everything kind of slotted in.

Lucila pointed out that the truck that caused a traffic jam on the 17th was the most likely suspect for causing all this havoc.

Thanks should go to all my classmates at Information Visualization, as a fair deal of bouncing of ideas went on and I can't claim I thought of all the ideas in this submission. I did develop all the visualizations myself though.

The only thing I can't understand is this next figure. What is shown there is both the number of sick messages vs. time, and the number of messages from people that have a sick friend or relative (but that aren't sick themselves). Why do so many people start texting that they have sick relatives at 2AM if the first cases don't appear until 8AM?

Hypothesis on how the infection is being transmitted.

As the video briefly explains, there are indeed two ways the disease is transmitted. Starting on the 18th at around 8:10AM, you can see the disease is airborne. The other starts to show around 2AM on the 19th, and is in the water of the River Vast.

On the afternoon of the 18th people go home sick and that generates many messages from people that got sick that morning, and yet you never see their neighbours or relatives get sick. Althought there are many sick people going out into the suburbs, there are no new cases reported there, so the disease is not person-person transmitted.

Is the outbreak contained?

Yes. As the video shows, during May 20th there are very few new cases reported. This seems to have been an environmental agent that isn't acting any longer.